Goto

Collaborating Authors

 polygenic risk score


Integrating Genomics into Multimodal EHR Foundation Models

Amar, Jonathan, Liu, Edward, Breschi, Alessandra, Zhang, Liangliang, Kheradpour, Pouya, Li, Sylvia, Lehmann, Lisa Soleymani, Giulianelli, Alessandro, Edwards, Matt, Jia, Yugang, Nola, David, Mani, Raghav, Vats, Pankaj, Tetreault, Jesse, Chen, T. J., McLean, Cory Y.

arXiv.org Artificial Intelligence

This paper introduces an innovative Electronic Health Record (EHR) foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality, moving beyond traditional EHR-only approaches to build more holistic health profiles. Leveraging the extensive and diverse data from the All of Us (AoU) Research Program, this multimodal framework aims to learn complex relationships between clinical data and genetic predispositions. The methodology extends advancements in generative AI to the EHR foundation model space, enhancing predictive capabilities and interpretability. Evaluation on AoU data demonstrates the model's predictive value for the onset of various conditions, particularly Type 2 Diabetes (T2D), and illustrates the interplay between PRS and EHR data. The work also explores transfer learning for custom classification tasks, showcasing the architecture's versatility and efficiency. This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies, laying the groundwork for more personalized, equitable, and actionable real-world evidence generation in healthcare.


The race to make the perfect baby is creating an ethical mess

MIT Technology Review

A new field of science claims to be able to predict aesthetic traits, intelligence, and even moral character in embryos. Is this the next step in human evolution or something more dangerous? Consider, if you will, the translucent blob in the eye of a microscope: a human blastocyst, the biological specimen that emerges just five days or so after a fateful encounter between egg and sperm. This bundle of cells, about the size of a grain of sand pulled from a powdery white Caribbean beach, contains the coiled potential of a future life: 46 chromosomes, thousands of genes, and roughly six billion base pairs of DNA--an instruction manual to assemble a one-of-a-kind human. Now imagine a laser pulse snipping a hole in the blastocyst's outermost shell so a handful of cells can be suctioned up by a microscopic pipette. This is the moment, thanks to advances in genetic sequencing technology, when it becomes possible to read virtually that entire instruction manual. An emerging field of science seeks to use the analysis pulled from that procedure to predict what kind of a person that embryo might become. Some parents turn to these tests to avoid passing on devastating genetic disorders that run in their families. A much smaller group, driven by dreams of Ivy League diplomas or attractive, well-behaved offspring, are willing to pay tens of thousands of dollars to optimize for intelligence, appearance, and personality. Some of the most eager early boosters of this technology are members of the Silicon Valley elite, including tech billionaires like Elon Musk, Peter Thiel, and Coinbase CEO Brian Armstrong. Embryo selection is less like a build-a-baby workshop and more akin to a store where parents can shop for their future children from several available models--complete with stat cards. But customers of the companies emerging to provide it to the public may not be getting what they're paying for. Genetics experts have been highlighting the potential deficiencies of this testing for years.


FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313

Ge, Aaron, Balasubramanian, Jeya, Wu, Xueyao, Kraft, Peter, Almeida, Jonas S.

arXiv.org Artificial Intelligence

Genotype imputation enhances genetic data by predicting missing SNPs using reference haplotype information. Traditional methods leverage linkage disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity of LD structures between genotyped target sets and fully sequenced reference panels. Recently, reference-free deep learning-based methods have emerged, offering a promising alternative by predicting missing genotypes without external databases, thereby enhancing privacy and accessibility. However, these methods often produce models with tens of millions of parameters, leading to challenges such as the need for substantial computational resources to train and inefficiency for client-sided deployment. Our study addresses these limitations by introducing a baseline for a novel genotype imputation pipeline that supports client-sided imputation models generalizable across any genotyping chip and genomic region. This approach enhances patient privacy by performing imputation directly on edge devices. As a case study, we focus on PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk prediction. Utilizing consumer genetic panels such as 23andMe, our model democratizes access to personalized genetic insights by allowing 23andMe users to obtain their PRS313 score. We demonstrate that simple linear regression can significantly improve the accuracy of PRS313 scores when calculated using SNPs imputed from consumer gene panels, such as 23andMe. Our linear regression model achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with simple imputation (substituting missing SNPs with the minor allele frequency). These findings suggest that popular SNP analysis libraries could benefit from integrating linear regression models for genotype imputation, providing a viable and light-weight alternative to reference based imputation.


AI Machine Learning Predicts Alzheimer's Disease Risk

#artificialintelligence

The most common cause of dementia worldwide is Alzheimer's disease (AD), a neurodegenerative disorder with no known cure. A new study published in Scientific Reports uses artificial intelligence (AI) machine learning (ML) and data from electronic health records (EHRs) to identify the important predictors for Alzheimer's disease and finds that a person's genetics outperforms age as a predictor for individuals who are 65 years of age and older. "Machine learning (ML) methods provide an attractive and effective alternative to traditional statistical regression models, especially in situations where one has a large number of features or predictors," wrote the authors of the National Institutes of Health (NIH) funded study led by Xiaoyi Raymond Gao at The Ohio State University College of Medicine, with Ohio State researchers Marion Chiariglione, Ke Qin and Douglas Scharre; the University of Miami researchers Karen Nuytemans and Eden Martin; and Yi-Ju Li at Duke University. Globally, Alzheimer's disease accounts for an estimated 60-70 percent of the over 55 million people with dementia and affects women disproportionately according to the World Health Organization (WHO). In the U.S., there are currently 6.7 million people aged 65 and older with living AD, of which almost two-thirds are women and that figure will increase significantly to an estimated 12.7 million Americans by 2050 according to the Alzheimer's Association.


Machine Learning Models Rank Predictive Risks for Alzheimer's Disease - Neuroscience News

#artificialintelligence

Summary: Using machine learning technology, researchers concluded the risk of genetic risk may outweigh age as a predictor of whether a person will develop Alzheimer's disease. Once adults reach age 65, the threshold age for the onset of Alzheimer's disease, the extent of their genetic risk may outweigh age as a predictor of whether they will develop the fatal brain disorder, a new study suggests. The study, published recently in the journal Scientific Reports, is the first to construct machine learning models with genetic risk scores, non-genetic information and electronic health record data from nearly half a million individuals to rank risk factors in order of how strong their association is with eventual development of Alzheimer's disease. Researchers used the models to rank predictive risk factors for two populations from the UK Biobank: White individuals aged 40 and older, and a subset of those adults who were 65 or older. Results showed that age – which constitutes one-third of total risk by age 85, according to the Alzheimer's Association – was the biggest risk factor for Alzheimer's in the entire population, but for the older adults, genetic risk as determined by a polygenic risk score was more predictive.

  AI-Alerts: 2023 > 2023-04 > AAAI AI-Alert for Apr 5, 2023 (1.00)
  Country: North America > United States > Ohio (0.09)
  Genre: Play (0.32)

Machine learning models rank predictive risks for Alzheimer's disease

#artificialintelligence

Once adults reach age 65, the threshold age for the onset of Alzheimer's disease, the extent of their genetic risk may outweigh age as a predictor of whether they will develop the fatal brain disorder, a new study suggests. The study, published recently in the journal Scientific Reports, is the first to construct machine learning models with genetic risk scores, non-genetic information and electronic health record data from nearly half a million individuals to rank risk factors in order of how strong their association is with eventual development of Alzheimer's disease. Researchers used the models to rank predictive risk factors for two populations from the UK Biobank: White individuals aged 40 and older, and a subset of those adults who were 65 or older. Results showed that age – which constitutes one-third of total risk by age 85, according to the Alzheimer's Association – was the biggest risk factor for Alzheimer's in the entire population, but for the older adults, genetic risk as determined by a polygenic risk score was more predictive. "We all know Alzheimer's disease is a later-onset disease, so we know age is an important risk factor. But when we consider risk only for people age 65 or older, then genetic information captured by a polygenic risk score ranks higher than age," said lead study author Xiaoyi Raymond Gao, associate professor of ophthalmology and visual sciences and of biomedical informatics in The Ohio State University College of Medicine.


Machine Learning Models Rank Predictive Risks for Alzheimer's Disease

#artificialintelligence

The study, published recently in the journal Scientific Reports, is the first to construct machine learning models with genetic risk scores, non-genetic information and electronic health record data from nearly half a million individuals to rank risk factors in order of how strong their association is with eventual development of Alzheimer's disease. Researchers used the models to rank predictive risk factors for two populations from the UK Biobank: White individuals aged 40 and older, and a subset of those adults who were 65 or older. Results showed that age - which constitutes one-third of total risk by age 85, according to the Alzheimer's Association - was the biggest risk factor for Alzheimer's in the entire population, but for the older adults, genetic risk as determined by a polygenic risk score was more predictive. "We all know Alzheimer's disease is a later-onset disease, so we know age is an important risk factor. But when we consider risk only for people age 65 or older, then genetic information captured by a polygenic risk score ranks higher than age," said lead study author Xiaoyi Raymond Gao, associate professor of ophthalmology and visual sciences and of biomedical informatics in The Ohio State University College of Medicine.


Explainable machine learning aggregates polygenic risk scores and electronic health records for Alzheimer's disease prediction

#artificialintelligence

Alzheimer’s disease (AD) is the most common late-onset neurodegenerative disorder. Identifying individuals at increased risk of developing AD is important for early intervention. Using data from the Alzheimer Disease Genetics Consortium, we constructed polygenic risk scores (PRSs) for AD and age-at-onset (AAO) of AD for the UK Biobank participants. We then built machine learning (ML) models for predicting development of AD, and explored feature importance among PRSs, conventional risk factors, and ICD-10 codes from electronic health records, a total of > 11,000 features using the UK Biobank dataset. We used eXtreme Gradient Boosting (XGBoost) and SHapley Additive exPlanations (SHAP), which provided superior ML performance as well as aided ML model explanation. For participants age 40 and older, the area under the curve for AD was 0.88. For subjects of age 65 and older (late-onset AD), PRSs were the most important predictors. This is the first observation that PRSs constructed from the AD risk and AAO play more important roles than age in predicting AD. The ML model also identified important predictors from EHR, including urinary tract infection, syncope and collapse, chest pain, disorientation and hypercholesterolemia, for developing AD. Our ML model improved the accuracy of AD risk prediction by efficiently exploring numerous predictors and identified novel feature patterns.


Adaptive Sampling Strategies to Construct Equitable Training Datasets

Cai, William, Encarnacion, Ro, Chern, Bobbie, Corbett-Davies, Sam, Bogen, Miranda, Bergman, Stevie, Goel, Sharad

arXiv.org Artificial Intelligence

In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.


Epigenetic Health Monitoring to Reduce Your Future Illness Risk – EP13: Tom Stubbs (Chronomics) – Hyper Wellbeing Innovation Labs, Inc. Blog

#artificialintelligence

In this thirteenth episode, Tom Stubbs, Co-Founder/CEO of Chronomics starts with introducing epigenetics. He describes the technology and expertise that he's brought together to create the only company in the world advancing the forefront of epigenetic biomarkers. He explains how their A.I. based health biomarker engine will be used to reduce your risk of future illness. Thank you for having me on the show. Pleasure to be here and looking forward to chatting with you. Tom: We are very much focused on measuring health so people can avoid disease. Lee: Measuring health so that people can avoid disease, that sounds a little bit cryptic. I mean, essentially we're focused on providing people with objective measures that capture the broader definition of health. So not merely health being the absence of disease, but actually as defined by the World Health Organization over 70 years ago, health being the complete physical, mental and social wellbeing of a person. And we think that this is super important, because with the rise of aging populations and the growth in chronic conditions globally, such as heart disease and type two diabetes, there's a growing need for healthcare to shift towards prevention. And to enable this shift, we need measures to capture the largest risk factors for these conditions ahead of time so that people can prevent through action. Lee: So I think I was one of the first users of Chronomics. I had contacted yourselves at the end of 2018 and took a whole genome sequence and an epigenetic test. We first were putting the product out 2018, and yes, you were among one of the first users of the product. Pleasure to have had you and still have you as a customer, Lee. Lee: And I remember yourselves very favorably, because I was a little bit skeptical because Tommy Woods had informed me that the business model of quite a few companies in the OMIC space is to give you a large questionnaire, apply AI to it, and I've had it demonstrated now to me that based on a simple questionnaire, AI can derive a lot of information about you on the health front, predictive, way more than the OMICS can in some cases. And these companies are doing this heavy OMICS data acquisition, not so much to give you data at the moment, I mean, information, but in order that may be in 5, 10 years, that vast sum of data that can then do something with. And so, I was skeptical at Chronomics maybe doing that, and I said, please make a special case for me. Give me my results without the questionnaire. Tom: Yeah, I do remember this, Lee. And then I said, hey look, if I'm doing a whole genome sequence, I actually want a copy of it. So send me every letter.